The data itself - Guns/Violence in America.
Gun violence is an ongoing problem in modern America. A lot of people are divided on the issue of gun control, however the numbers speak for themselves. Every year nearly 40,000 Americans are killed by guns, including more than 23,000 who die by firearm suicide, 14,000 who die by firearm homicide, more than 500 who die by legal intervention, nearly 500 who die by unintentional firearm injuries, and more than 300 who die by undetermined intent.
First, let me explain the variables + the data set
This is a data frame containing 1,173 observations on 13 variables. The following variables consist of :
state : factor indicating state.
year:factor indicating year
violent : violent crime rate (incidents per 100,000 members of the population).
murder : murder rate (incidents per 100,000).
robbery : robbery rate (incidents per 100,000).
prisoners : incarceration rate in the state in the previous year (sentenced prisoners per 100,000
residents; value for the previous year).
afam : percent of state population that is African-American, ages 10 to 64.
cauc : percent of state population that is Caucasian, ages 10 to 64.
male : percent of state population that is male, ages 10 to 29.
population : state population, in millions of people.
income : real per capita personal income in the state (US dollars).
density: population per square mile of land area, divided by 1,000.
law : factor. Does the state have a shall carry law in effect in that year?
In this dataset we can see the data for violence, murder, robberies, priosners, and some races of each state’s population. There are also other variables we can explore in this dataset for further analyzation of what causes these crimes in certain states by making correlations, without implying causation.Furthermore, in this lecture, I am going to breakdown this data by most pertaining factors that are an issue around gun laws/gun control.
The main point of this lecture is to learn how to manipulate and clean data. This can be used through the library, dplyr. Dplyr is arguably one of the most important libraries on R. It can be used for data cleaning, manipulation, and creation of new variables.
Objective - When working with data you must:
Figure out what you want to do.
Describe those tasks in the form of a computer program.
Execute the program.
Execution - -In order to perform the dplyr we need to learn the syntax. -You must use %>% to transform datasets. - For example :
If I had a dataset called cars and I wanted to transform it I would do the following
Cars_2 <- Cars %>% rename(#one of the most common )%>% #apply it again! mutate%>% filter%>% arrange%>% select
Cars_2
These are some of the most common uses of the dplyr package, and why we will be using it intensively today.
1 - Introduction 2 - Understanding the data further 3 - Preparation 4 - Reading in the data 5 - Data cleaning 6 - Getting key numbers - summaries, means, medians of the data 7 - plots (ggplots) + explanations plus significance 8 - US map plot (new library) 9 - More statistics based on plots 10 - Correlations 11 - More! 12 - Conclusion 12- Takeaways
Each observation is a given state in a given year. There are a total of 51 states times 23 years = 1,173 observations. This data set is between the years 1977–1999. Although somewhat old, it is still accurate to numbers today.
2 - Preparation! Reading in the necessary libraries
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
We do this by using read.csv and making sure the directory is correct
Guns <- read.csv('/Users/rishisiddharth/Desktop/Data_412/DataSets/guns.csv')
There are many other methods/ways to data clean given the dataset, but this dataset is good for the majority of things. For this dataset all we need to do in rename variables and to create new varirables
4a -renaming certain variables for further clarification
Guns <- Guns %>%
rename(africanAmerican = afam, violence = violent, caucasian = cauc)
head(Guns, n = 20)
## rownames year violence murder robbery prisoners africanAmerican caucasian
## 1 1 1977 414.4 14.2 96.8 83 8.384873 55.12291
## 2 2 1978 419.1 13.3 99.1 94 8.352101 55.14367
## 3 3 1979 413.3 13.2 109.5 144 8.329575 55.13586
## 4 4 1980 448.5 13.2 132.1 141 8.408386 54.91259
## 5 5 1981 470.5 11.9 126.5 149 8.483435 54.92513
## 6 6 1982 447.7 10.6 112.0 183 8.514000 54.89621
## 7 7 1983 416.0 9.2 98.4 215 8.545608 54.83936
## 8 8 1984 431.2 9.4 96.1 243 8.559511 54.77876
## 9 9 1985 457.5 9.8 105.4 256 8.562801 54.67899
## 10 10 1986 558.0 10.1 111.6 267 8.566521 54.51791
## 11 11 1987 559.2 9.3 112.2 283 8.592103 54.38770
## 12 12 1988 558.6 9.9 117.8 307 8.618144 54.23505
## 13 13 1989 590.8 10.2 133.9 300 8.638031 54.06622
## 14 14 1990 708.6 11.6 143.7 328 8.699674 56.07016
## 15 15 1991 844.2 11.5 152.8 370 8.771641 55.97353
## 16 16 1992 871.7 11.0 164.9 394 8.877969 55.80952
## 17 17 1993 780.4 11.6 159.5 407 8.972758 55.66076
## 18 18 1994 683.7 11.9 171.2 431 9.047583 55.50783
## 19 19 1995 632.4 11.2 185.8 450 9.094921 55.33187
## 20 20 1996 565.4 10.0 167.0 471 9.149420 55.31420
## male population income density state law
## 1 18.17441 3.780403 9563.148 0.0745524 Alabama no
## 2 17.99408 3.831838 9932.000 0.0755667 Alabama no
## 3 17.83934 3.866248 9877.028 0.0762453 Alabama no
## 4 17.73420 3.900368 9541.428 0.0768288 Alabama no
## 5 17.67372 3.918531 9548.351 0.0771866 Alabama no
## 6 17.51052 3.925229 9478.919 0.0773185 Alabama no
## 7 17.35089 3.934103 9783.000 0.0774933 Alabama no
## 8 17.11902 3.951826 10357.200 0.0778424 Alabama no
## 9 16.85875 3.972520 10725.860 0.0782500 Alabama no
## 10 16.57609 3.991562 11091.620 0.0786251 Alabama no
## 11 16.28230 4.015257 11323.820 0.0790919 Alabama no
## 12 15.99270 4.023848 11654.960 0.0792611 Alabama no
## 13 15.67523 4.030224 11963.900 0.0793867 Alabama no
## 14 15.38070 4.048508 12063.980 0.0797736 Alabama no
## 15 15.18314 4.091025 12087.820 0.0806113 Alabama no
## 16 15.02558 4.139269 12398.020 0.0815620 Alabama no
## 17 14.86296 4.193114 12395.800 0.0826229 Alabama no
## 18 14.67744 4.232965 12673.920 0.0834082 Alabama no
## 19 14.54549 4.262731 12872.680 0.0839947 Alabama no
## 20 14.41802 4.290403 12908.910 0.0845400 Alabama no
4b - Creating new variables we can rename the columns for them to make more sense
# Creating abbreviation for states
Guns <- Guns %>%
mutate(states_abb = substr(state, 1, 2))
we can mutate certain variables to make new variables, like we did here
We can calculate things such murders per state
Guns <- Guns %>% #same dataframe
group_by(state) %>% #group by state to make it clear
mutate(murder_per_state = mean(murder, na.rm = TRUE)) %>% #mutate, and create the new variable with the mean murder amount
ungroup()
Keep in mind this is throughout all 22 years! By grouping the state, we can get the amount of murders per state during the 22 years
# Calculating the violence per state, same concept
Guns <- Guns %>%
group_by(state) %>%
mutate(violence_per_state = mean(violence, na.rm = TRUE)) %>%
ungroup()
#if we do the same thing, we can get the mean amount of robberies during the 22 years!
# Calculating robberies per state
Guns <- Guns %>%
group_by(state) %>%
mutate(robberies_per_state = mean(robbery, na.rm = TRUE)) %>%
ungroup()
#same thing for robberies! This gives us an insight to which state has the most murders, robbery, prisoners
# Calculating prisoners per state
Guns <- Guns %>%
group_by(state) %>%
mutate(prisoners_per_state = mean(prisoners, na.rm = TRUE)) %>%
ungroup() # Ungrouping after the operation
Guns
## # A tibble: 1,173 × 19
## rownames year violence murder robbery prisoners africanAmerican caucasian
## <int> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1 1977 414. 14.2 96.8 83 8.38 55.1
## 2 2 1978 419. 13.3 99.1 94 8.35 55.1
## 3 3 1979 413. 13.2 110. 144 8.33 55.1
## 4 4 1980 448. 13.2 132. 141 8.41 54.9
## 5 5 1981 470. 11.9 126. 149 8.48 54.9
## 6 6 1982 448. 10.6 112 183 8.51 54.9
## 7 7 1983 416 9.2 98.4 215 8.55 54.8
## 8 8 1984 431. 9.4 96.1 243 8.56 54.8
## 9 9 1985 458. 9.8 105. 256 8.56 54.7
## 10 10 1986 558 10.1 112. 267 8.57 54.5
## # ℹ 1,163 more rows
## # ℹ 11 more variables: male <dbl>, population <dbl>, income <dbl>,
## # density <dbl>, state <chr>, law <chr>, states_abb <chr>,
## # murder_per_state <dbl>, violence_per_state <dbl>,
## # robberies_per_state <dbl>, prisoners_per_state <dbl>
Same thing here. We can see here that D.C ironically has the most murders, most violence, robberies, and prisoners according to this dataset. Although we found the state with the most for these varirables, we could make it easier on ourselves instead of creating these columns and scrolling through the data manually, we can….
However - Before that lets discuss in further data preparation.
Data preparation - Some struggles I faced were figuring out how I needed to clean/manipulate the data in order to perform the stastical test as I wanted to. In all my years of performing statistical tests there has been one thing I have learned; have a plan. My plan for this dataset forced me to think of my questions of the dataset at the very beginning, which allowed me to create a plan, thus being able to execute the first, and most important steps: data manipulation. If one were to mess this step up, the statistical performance wouldn’t be most optimal, and the overall project would lack consistency.
# We can create a new dataframe called Guns_summary
Guns_summary <- Guns %>%
group_by(state, year) %>%
summarise(
average_violent = mean(violence, na.rm = TRUE), # Replace 'violence' with the correct column name if different
average_murder = mean(murder, na.rm = TRUE),
average_robbery = mean(robbery, na.rm = TRUE),
average_prisoners = mean(prisoners, na.rm = TRUE),
.groups = 'drop' # This will ungroup the data after summarisation
)
Guns_summary
## # A tibble: 1,173 × 6
## state year average_violent average_murder average_robbery average_prisoners
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Alaba… 1977 414. 14.2 96.8 83
## 2 Alaba… 1978 419. 13.3 99.1 94
## 3 Alaba… 1979 413. 13.2 110. 144
## 4 Alaba… 1980 448. 13.2 132. 141
## 5 Alaba… 1981 470. 11.9 126. 149
## 6 Alaba… 1982 448. 10.6 112 183
## 7 Alaba… 1983 416 9.2 98.4 215
## 8 Alaba… 1984 431. 9.4 96.1 243
## 9 Alaba… 1985 458. 9.8 105. 256
## 10 Alaba… 1986 558 10.1 112. 267
## # ℹ 1,163 more rows
#First, we can
# Find the state with the most murders, ensure the data is ungrouped
state_most_murders <- Guns_summary %>%
ungroup() %>% # Remove any existing grouping
arrange(desc(average_murder)) %>%
slice_head(n = 1)
#state with the most prisoners, ensure the data is ungrouped
state_most_prisoners <- Guns_summary %>%
ungroup() %>% # Remove any existing grouping
arrange(desc(average_prisoners)) %>%
slice_head(n = 1)
#state withe most robberiers
state_most_robberies <- Guns_summary %>%
ungroup() %>% # Remove any existing grouping
arrange(desc(average_robbery)) %>%
slice_head(n = 1)
# Printing the state with the most murders
print("State with the most murders:")
## [1] "State with the most murders:"
print(state_most_murders)
## # A tibble: 1 × 6
## state year average_violent average_murder average_robbery average_prisoners
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Distri… 1991 2453. 80.6 1216. 1148
# Printing the state with the most prisoners
print("State with the most prisoners:")
## [1] "State with the most prisoners:"
print(state_most_prisoners)
## # A tibble: 1 × 6
## state year average_violent average_murder average_robbery average_prisoners
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Distri… 1999 1628. 46.4 644. 1913
#printing the state with most robberies
print("State with the most robberies:")
## [1] "State with the most robberies:"
print(state_most_robberies)
## # A tibble: 1 × 6
## state year average_violent average_murder average_robbery average_prisoners
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Distri… 1981 2275. 35.1 1635. 426
As seen in the print statements, it’s confirmed D.C unfortunately the most murders, robberies, and prisoners, now, we can addd a column to indicate the time period (since it’s only one period, all values will be the same).
Guns <- Guns %>%
mutate(period = "1977-1999")
Firstly, notice how we are still using dplyr throughout this lecture, even if we aren’t cleaning/manipulating the data.
1.creating a bar graph that compares each state based on murders, robberies 2. sum of the murders and robberies for each state
Guns_summed <- Guns %>%
group_by(state) %>%
summarise(murder_sum = sum(murder, na.rm = TRUE),
robbery_sum = sum(robbery, na.rm = TRUE),
prisoner_sum = sum(prisoners,na.rm = TRUE))
Now create the bar graph with ggplot
barplt <- ggplot(data = Guns_summed, aes(x = state)) +
geom_bar(aes(y = murder_sum), stat = "identity", position = "dodge", fill = "blue") +
geom_bar(aes(y = prisoner_sum), stat = "identity", position = "dodge", fill = "red") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
print(barplt)
After a thorough analysis of the bar graph, it confirms that D.C has the
highest murder rate in these years.
plot2 <- ggplot(data = Guns_summed, aes(x = state)) +
geom_bar(aes(y = murder_sum), stat = "identity", position = "dodge", fill = "blue") +
geom_bar(aes(y = robbery_sum), stat = "identity", position = "dodge", fill = "red") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
print(plot2)
These plots show, that ironically, in the District of Columbia, there
are the most violent crime rates 7 - U.S MAP? ————————————– 1. Install
and load required libraries 2. Using the new package (leaflet), plus a
new dataset of US states, we can plot the US maps, and join the Guns
data to see the states with the most murders in the years
library(leaflet)
library(usmap)
library(dplyr)
plot_usmap(data = Guns, values = "murder", regions = "states") + #we use the old data set Guns to get the values of murder. New dataset to get the regions
scale_fill_continuous(low = "white", high = "red", name = "Murders", label = scales::comma) + #these are just commands to make the graph look pretty
labs(title = "Murder Rate by State",
subtitle = "Number of murders represented by color intensity.") +
theme(legend.position = "right")
plot_usmap(data = Guns, values = "prisoners", regions = "states") + #do the same thing but do prisoners!
scale_fill_continuous(low = "white", high = "red", name = "Murders", label = scales::comma) +
labs(title = "Prisoner Rate by State",
subtitle = "Number of prisoners represented by color intensity.") +
theme(legend.position = "right")
plot_usmap(data = Guns, values = "robbery", regions = "states") +
scale_fill_continuous(low = "white", high = "red", name = "Murders", label = scales::comma) +
labs(title = "Robbery Rate by State",
subtitle = "Number of robberies represented by color intensity.") +
theme(legend.position = "right")
Given the information thats given, lets revist the top state/place with
the highest rates for murder, robbery, and violence: Washington D.C.
WashingtonDC_Afac <- Guns %>%
select(state, africanAmerican) %>%
filter(state == "District of Columbia")%>%
summarise(average_afac = mean(africanAmerican, na.rm = TRUE))
WashingtonDC_Afac
## # A tibble: 1 × 1
## average_afac
## <dbl>
## 1 23.9
23.9 percent of state population if African American
WashingtonDC_cauc <- Guns%>%
select(state, caucasian)%>%
filter(state == "District of Columbia")%>%
summarise(average_cauc = mean(caucasian,na.rm = TRUE))
WashingtonDC_cauc
## # A tibble: 1 × 1
## average_cauc
## <dbl>
## 1 24.9
24.9 % of state population is Caucasian
Afac_plot <- ggplot(Guns, aes(x = "", y = africanAmerican, fill = state)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(fill = "State", y = "African American (%)", x = "") +
theme_void()
print(Afac_plot)
Given the pie chart, it shows that D.C has the most African Americans and murder rate, prisoner rate, and robbery rate. But are they correlated?
Cauc_plot <- ggplot(Guns, aes(x = "", y = caucasian, fill = state)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") + # This converts the bar chart to a pie chart
labs(fill = "State", y = "African American (%)", x = "") + # Add labels
theme_void() # This removes the background, gridlines, and text
print(Cauc_plot)
This pie chart shows that Vermont has the highest Caucasian rate,
however it is hard to tell. Lets find another way to find the state with
the most caucasians
state_order_caucasian <- Guns %>%
select(state, caucasian) %>%
arrange(desc(caucasian))
state_order_caucasian
## # A tibble: 1,173 × 2
## state caucasian
## <chr> <dbl>
## 1 Vermont 76.5
## 2 Vermont 76.1
## 3 Vermont 75.7
## 4 Vermont 75.2
## 5 Maine 75.1
## 6 Vermont 75.0
## 7 Maine 74.8
## 8 Vermont 74.7
## 9 Vermont 74.7
## 10 New Hampshire 74.7
## # ℹ 1,163 more rows
this shows the order of state with the most Caucasians. Vermont wins, with a wopping 76.5% rate in one year!
state_order_afac <- Guns%>%
select(state, africanAmerican)%>%
arrange(desc(africanAmerican))
state_order_afac
## # A tibble: 1,173 × 2
## state africanAmerican
## <chr> <dbl>
## 1 Hawaii 27.0
## 2 Hawaii 26.9
## 3 Hawaii 26.8
## 4 Hawaii 26.7
## 5 District of Columbia 25.9
## 6 District of Columbia 25.8
## 7 District of Columbia 25.8
## 8 District of Columbia 25.7
## 9 District of Columbia 25.6
## 10 District of Columbia 25.4
## # ℹ 1,163 more rows
We can do the same thing here, and we confirm that Hawaii has the most African American population with 27%, with D.C closely behind at 25.9%
D.C has the highest murder, prison, and robbery rate. It also has the highest population of African Americans.
But are they actually correlated?
We are going to use spearmans test. This is because it measures the strength and direction of association between two ranked variables. #we are also going to do income and murder first, as a practice run
spearman <- ggplot(data = Guns, mapping = aes(x = murder, y = income)) +
geom_point()
print(spearman)
A few observations about the plot:
Data Concentration: Most of the data points are clustered at the lower end of the murder count, which suggests that in the majority of the observations, the murder count is low.
No Clear Trend: There is no clear upward or downward trend visible in the scatter plot that would indicate a strong positive or negative correlation between income and murder count.
Outliers: There seem to be a few potential outliers with higher murder counts, but these do not appear to follow any trend with respect to income.
Okay to use
spearman_test <- cor.test(Guns$murder, Guns$income, method = "spearman")
## Warning in cor.test.default(Guns$murder, Guns$income, method = "spearman"):
## Cannot compute exact p-value with ties
print(spearman_test)
##
## Spearman's rank correlation rho
##
## data: Guns$murder and Guns$income
## S = 274459941, p-value = 0.4869
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.02032023
The p-value is 0.4869, which is not below the common alpha level threshold of 0.05 for statistical significance. This high p-value suggests that you cannot reject the null hypothesis, which is that there is no monotonic association between the two variables.
Now we are going to do it for murder and African American percentage
spearman_assumptions <- ggplot(data = Guns, mapping = aes(x = africanAmerican, y = murder)) +
geom_point()
print(spearman_assumptions)
spearman_test <- cor.test(Guns$murder, Guns$africanAmerican, method = "spearman")
## Warning in cor.test.default(Guns$murder, Guns$africanAmerican, method =
## "spearman"): Cannot compute exact p-value with ties
print(spearman_test)
##
## Spearman's rank correlation rho
##
## data: Guns$murder and Guns$africanAmerican
## S = 69224664, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7426534
The positive rho value signifies that as the value of the africanAmerican variable increases, the number of murders tends to increase as well. The relationship is monotonic, which means that the variables tend to move in the same direction, but not necessarily at a constant rate across the entire range of values.
The p-value is less than 2.2e-16, which is essentially zero, and it indicates that the observed correlation is extremely statistically significant. This means that the probability of obtaining such a correlation by random chance is virtually non-existent, assuming the null hypothesis of no relationship was true.
That being said correlation doesn’t cause causation!
USAfacts.org posted an article on November 8, 2023, which showed the highest murder rates. In their article, they state “Although Washington, DC, had a higher homicide death rate (33.3 homicide deaths per 100,000 people) than every state, it’s not a state — given its population density, a fairer comparison is to counties in major metropolitan areas.” This is exactly what our data showed, which, like the article states, is somewhat of an unfair compariosn due to it’s population density.
https://usafacts.org/articles/which-states-have-the-highest-murder-rates/
That being said we can do one more test that compares population density with murder rate to give a more acurate comparison. First lets find the city with the highest population density
state_order_density <- Guns%>%
select(state, density)%>%
arrange(desc(density))
state_order_density
## # A tibble: 1,173 × 2
## state density
## <chr> <dbl>
## 1 District of Columbia 11.1
## 2 District of Columbia 10.9
## 3 District of Columbia 10.7
## 4 District of Columbia 10.1
## 5 District of Columbia 10.1
## 6 District of Columbia 10.1
## 7 District of Columbia 10.1
## 8 District of Columbia 10.1
## 9 District of Columbia 10.1
## 10 District of Columbia 10.1
## # ℹ 1,163 more rows
The population per square mile of land area is huge compared to other states… lets perform another statistical test to test this further satisty the requirments.
ggplot(Guns, aes(x =density, y = murder)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Population Density", y = "Murder Rate",
title = "Scatter Plot of Murder Rate vs Population Density")
## `geom_smooth()` using formula = 'y ~ x'
Lets change the scale to make it more even
ggplot(Guns, aes(x = density, y = murder)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_x_log10() +
labs(x = "Population Density (log scale)", y = "Murder Rate",
title = "Scatter Plot of Murder Rate vs Population Density (Log Scale)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Much better - There is a visible positive trend, as indicated by the
linear regression line. This suggests that there is a tendency for the
murder rate to increase with population density, at least within the
range of data points that are more densely packed.
Step 3: Regression Analysis
regression_result <- lm(murder ~ density, data = Guns)
summary(regression_result)
##
## Call:
## lm(formula = murder ~ density, data = Guns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.548 -3.364 -0.501 2.897 33.993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.2026 0.1505 41.20 <2e-16 ***
## density 4.1546 0.1075 38.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.99 on 1171 degrees of freedom
## Multiple R-squared: 0.5604, Adjusted R-squared: 0.56
## F-statistic: 1493 on 1 and 1171 DF, p-value: < 2.2e-16
Conclusion
Overall, the model suggests that there is a significant positive relationship between population density and murder rates. However, it’s important to note that correlation does not imply causation, and other factors not included in the model may influence murder rates.That being said, due to the multiple R-squared value being 0.5604, we know that 56% of the murder rate is based on popultion density.
In this data set we focused on many of the variables/factors that affected murder, prison, and robbery rate in the U.S. We broke down this dataset by calculating simple statistics from the data, then grouping variables to find new numbers based on certain years, states, and etc. We were able to do this by using the dplyr package to create new variables, manipulate variables, and filter them. It was succesfful as we were able to find the states with the highest murder rate etc. After this, we comapred all 51 states (D.C) with a few bar graph, using the ggplot package. Then, we were able to graph the whole of the US using the leaflet and U.S map package, to give a visual of the most states with said rates of crimes. Given this information, we were able to analyze certain states and explore correlations with variables within the data. We were able to conclude all of this information with a thorough lecture of integrating tidyverse, dplyr, readr, ggplot, leaflet, and the US map packages together to complete a detailed analysis of this dataset.
The key takeaway from this lecture is not so much the pretty graphs or the complicated test statistics, but more so the use and the functionality of the dplyr package. By using this package we were able to create new variables, manipulate current variables, and clean the data. Over 60% of the work of a data scientist is cleaning data, as most data sets aren’t pretty. This is a neccesary skill to have; whether it be in R, python, or stata, etc.
The source of this dataset is follows. Online complements to Stock and Watson (2007). References - Ayres, I., and Donohue, J.J. (2003). Shooting Down the ‘More Guns Less Crime’ Hypothesis. Stanford Law Review, 55, 1193–1312. Stock, J.H. and Watson, M.W. (2007). Introduction to Econometrics, 2nd ed. Boston: Addison Wesley.
```